System and method for specifying batch execution ordering of requests in a storage system cluster

ABSTRACT

A method for operating a computer data storage system is described. A plurality of requests are received from a client, each request of the plurality of requests having assigned a unique sequence number, each request being an input/output request to a data storage device. The plurality of requests is divided into a plurality of subsets of requests. A unique batch number is assigned to each subset of requests so that each subset of requests is assigned a unique batch number. A first subset of requests having a first batch number is executed in arbitrary order with respect to the sequence number of each request. A second subset of requests is executed in response to a second batch number after execution of all of the first subset of requests has completed.

CROSS-REFERENCE TO RELATED APPLICATION

The present application is a continuation of U.S. patent applicationSer. No. 12/637,926, titled SYSTEM AND METHOD FOR SPECIFYING BATCHEXECUTION ORDERING OF REQUESTS IN A STORAGE SYSTEM CLUSTER, by Peter F.Corbett, filed on Dec. 15, 2009, which is a continuation of U.S. patentapplication Ser. No. 11/119,166, titled SYSTEM AND METHOD FOR SPECIFYINGBATCH EXECUTION ORDERING OF REQUESTS IN A STORAGE SYSTEM CLUSTER, byPeter F. Corbett, filed on Apr. 29, 2005, now issued as U.S. Pat. No.7,657,537 on Feb. 2, 2010, which is related to U.S. Pat. No. 7,443,872,entitled SYSTEM AND METHOD FOR MULTIPLEXING CHANNELS OVER MULTIPLECONNECTIONS IN A STORAGE SYSTEM CLUSTER. These patents are herebyincorporated by reference.

FIELD OF THE INVENTION

The present invention is directed to network protocols and, inparticular, to ordering of message operation execution in accordancewith a network protocol executing on a storage system cluster.

BACKGROUND OF THE INVENTION

A storage system typically comprises one or more storage devices intowhich information may be entered, and from which information may beobtained, as desired. The storage system includes a storage operatingsystem that functionally organizes the system by, inter alia, invokingstorage operations in support of a storage service implemented by thesystem. The storage system may be implemented in accordance with avariety of storage architectures including, but not limited to, anetwork-attached storage environment, a storage area network and a diskassembly directly attached to a client or host computer. The storagedevices are typically disk drives organized as a disk array, wherein theterm “disk” commonly describes a self-contained rotating magnetic mediastorage device. The term disk in this context is synonymous with harddisk drive (HDD) or direct access storage device (DASD).

The storage system may be further configured to operate according to aclient/server model of information delivery to thereby allow manyclients to access data containers, such as files and logical units,stored on the system. In this model, the client may comprise anapplication, such as a database application, executing on a computerthat “connects” to the storage system over a computer network, such as apoint-to-point link, shared local area network (LAN), wide area network(WAN), or virtual private network (VPN) implemented over a publicnetwork such as the Internet. Each client may request the services ofthe storage system by issuing file-based and block-based protocolmessages (in the form of packets) to the system over the network.

A plurality of storage systems may be interconnected to provide astorage system cluster configured to service many clients. Each storagesystem or node may be configured to service one or more volumes, whereineach volume stores one or more data containers. Communication among thenodes involves the exchange of information between two or more entitiesinterconnected by communication links. These entities are typicallysoftware programs executing on the nodes. The nodes communicate byexchanging discrete packets or messages of information according topredefined protocols. In this context, a protocol consists of a set ofrules defining how the nodes interact with each other.

Each node generally provides its services through the execution ofsoftware modules, such as processes. A process is a software programthat is defined by a memory address space. For example, an operatingsystem of the node may be implemented as a single process with a largememory address space, wherein pieces of code within the process provideoperating system services, such as process management. Yet, the node'sservices may also be implemented as separately-scheduled processes indistinct, protected address spaces. These separate processes, each withits own process address space, execute on the node to manage resourcesinternal to the node and, in the case of a database or network protocol,to interact with various network entities.

Services that are part of the same process address space communicate byaccessing the same memory space. That is, information exchanged betweenservices implemented in the same process address space is nottransferred, but rather may be accessed in a common memory. However,communication among services that are implemented as separate processesis typically effected by the exchange of messages. For example,information exchanged between different addresses spaces of processes istransferred as one or messages between different memory spaces of theprocesses. A known message-passing mechanism provided by an operatingsystem to transfer information between process address spaces is theInter Process Communication (IPC) mechanism.

Resources internal to the node may include communication resources thatenable a process on one node to communicate over the communication linksor network with another process on a different node. The communicationresources include the allocation of memory and data structures, such asmessages, as well as a network protocol stack. The network protocolstack, in turn, comprises layers of software, such as a session layer, atransport layer and a network layer. The Internet protocol (IP) is anetwork layer protocol that provides network addressing between nodes,whereas the transport layer provides a port service that identifies eachprocess executing on the nodes and creates a connection between thoseprocesses that indicate a willingness to communicate. Examples ofconventional transport layer protocols include the reliable connection(RC) protocol and the Transmission Control Protocol (TCP).

Broadly stated, the connection provided by the transport layer, such asthat provided by TCP, is a reliable, securable logical circuit betweenpairs of processes. A TCP process executing on each node establishes theTCP connection in accordance with a conventional “3-way handshake”arrangement involving the exchange of TCP message or segment datastructures. The resulting TCP connection is identified by port numbersand IP addresses of the nodes. The TCP transport service providesreliable delivery of a message using a TCP transport header. The TCPprotocol and establishment of a TCP connection are described in ComputerNetworks, 3rd Edition, particularly at pgs. 521-542, which is herebyincorporated by reference as though fully set forth herein.

Flow control is a protocol function that controls the flow of databetween network protocol stack layers in communicating nodes. At thetransport layer, for example, flow control restricts the flow of data(e.g., bytes) over a connection between the nodes. The transport layermay employ a fixed sliding-window mechanism that specifies the number ofbytes that can be exchanged over the network (communication link) beforeacknowledgement is required. Typically, the mechanism includes a fixedsized window or buffer that stores the data bytes and that is advancedby the acknowledgements.

The session layer manages the establishment or binding of an associationbetween two communicating processes in the nodes. In this context, theassociation is a session comprising a series of interactions between thetwo communicating processes for a period of time, e.g., during the spanof a connection. Upon establishment of the connection, the processestake turn exchanging commands and data over the session, typicallythrough the use of request and response messages. Flow control in thesession layer concerns the number of outstanding request messages(requests) that is allowed over the session at a time. Laggard responsemessages (responses) or long-running requests may force the institutionof session layer flow control to limit the flow of requests between theprocesses, thereby adversely impacting the session.

A solution that enables a session to continue to perform at highthroughput even in the event of a long-running request or a lost requestor response is described in the above-referenced U.S. Pat. No. 7,443,872entitled SYSTEM AND METHOD FOR MULTIPLEXING CHANNELS OVER MULTIPLECONNECTIONS IN A STORAGE SYSTEM CLUSTER. Here, a network protocolemploys multiple request channels within a session to allow high levelsof concurrency, i.e., to allow a large number of requests to beoutstanding within each channel. Multiple channels further allow aplurality of sessions to be multiplexed over the connections to therebyinsulate the sessions from lost throughput due to laggard responses orlong-running requests.

Broadly stated, each channel is embodied as a request window that storesoutstanding requests sent over the connection. Each request window has apredetermined initial sequence window size and the total number ofoutstanding requests in a session is the sum of the window sizes of allthe channels in the session. In addition, each request has a sequencenumber that is unique for that request and specifies its sequence in thechannel. Coupling the sequence number with a defined sequence windowsize provides flow and congestion control, limiting the number ofoutstanding requests in the channel. However, if the sequence number isalso used to specify an order of execution of requests, then no requestscan be executed out-of-order or concurrently within the channel.Requests on different channels can be executed concurrently orout-of-order respect to each other, but there is no way to enforce anordering of the requests in different channels with respect to eachother. It is desirable to be able to specify that a number of requestscan be executed in arbitrary order, but then occasionally insert abarrier that requires that all requests up to a certain point must beexecuted before any request after that point. Additionally, it isdesirable to specify an exact order of execution, while occasionallyallowing out of order execution or, alternately, to permit anyintermediate degree of control from completely ordered execution tocompletely arbitrary execution ordering.

SUMMARY OF THE INVENTION

The present invention overcomes the disadvantages of the prior art byproviding a system and method for specifying batch execution ordering ofrequests in a cluster of storage systems or nodes. Each node isgenerally organized as a network element and a disk element. Eachelement includes a cluster fabric interface module adapted to implementa network protocol, which integrates a session infrastructure and anapplication operation set into a session layer. The network protocol isillustratively a request/response protocol wherein an element(requester) receiving a data access request from a client redirects thatrequest to another element (responder) that services the request and,upon completion, returns a response.

In the illustrative embodiment, the session layer manages theestablishment and termination of sessions between requesters/respondersin the cluster and is built upon a connection layer that establishesconnections between the requesters/responders. Each session comprises aplurality of channels disposed over the connections, wherein eachchannel enables multiple requests to be sent over a connection. Eachrequest is identified by a unique identifier (“request id”) that isgenerally defined as the combination of a channel number and a sequencenumber. To that end, each channel is identified by a channel number,which is unique within the direction of request flow in the session. Inaddition, each request has a sequence number that is unique for thatrequest and specifies its sequence in the channel.

According to an aspect of the invention, the request id is extended toinclude a batch number that provides an execution ordering directivewithin a channel. That is, each request is also assigned a batch numberused to impose ordering of the request within the channel. All requestswith the same batch number in a channel can be executed in arbitraryorder or concurrently by the responder. Ordering is imposed only whenthe batch number changes, e.g., increases. Illustratively, the batchnumber increases monotonically with increasing sequence number. Althoughmore than one request in a channel can have the same batch number, allrequests with the same batch number are executed before any request witha higher batch number.

Advantageously, batch execution ordering allows multiple requests to beexecuted concurrently or out of sequence, while explicitly requiringordering among subsets of requests. That is, the use of batch numberswithin a channel allows imposition of an ordering constraint on requestsin the channel, as well as issuance of multiple unordered requests inthe channel. Moreover, layering of a batch number on a request id allowsimmediate and certain detection of a boundary between batches with nodanger of error. In other words, the batch number enables a responder todetermine whether a request can be immediately executed or must bestalled, and this determination can always be made optimally based onother requests received at that point.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of invention may be better understoodby referring to the following description in conjunction with theaccompanying drawings in which like reference numerals indicateidentical or functionally similar elements:

FIG. 1 is a schematic block diagram of a plurality of nodesinterconnected as a cluster in accordance with an embodiment of thepresent invention;

FIG. 2 is a schematic block diagram of a node in accordance with anembodiment of the present invention;

FIG. 3 is a schematic block diagram of a storage operating system thatmay be advantageously used with the present invention;

FIG. 4 is a schematic block diagram illustrating the format of a SpinNPmessage in accordance with an embodiment of with the present invention;

FIG. 5 is a schematic block diagram illustrating the organization ofcluster fabric interface modules adapted to implement a SpinNP protocolin accordance with an embodiment of the present invention;

FIG. 6 is a schematic block diagram illustrating channels of a sessionin accordance with an embodiment the present invention;

FIG. 7 is a schematic block diagram illustrating the use of batchnumbers within a channel of the session in accordance with the presentinvention;

FIG. 8 is a flowchart illustrating a procedure for specifying batchexecution ordering in accordance with the present invention;

FIG. 9A is a flowchart illustrating a procedure for processing receivedbatch execution ordered requests in accordance with the presentinvention;

FIG. 9B is a flowchart illustrating a procedure for processing receivedbatch execution ordered requests in accordance with the presentinvention; and

FIG. 10 is a flowchart illustrating a procedure for processing requestsin accordance with the present invention.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

A. Cluster Environment

FIG. 1 is a schematic block diagram of a plurality of nodes 200interconnected as a cluster 100 and configured to provide storageservice relating to the organization of information on storage devices.The nodes 200 comprise various functional components that cooperate toprovide a distributed storage system architecture of the cluster 100. Tothat end, each node 200 is generally organized as a network element(N-blade 310) and a disk element (D-blade 350). The N-blade 310 includesfunctionality that enables the node 200 to connect to clients 180 over acomputer network 140, while each D-blade 350 connects to one or morestorage devices, such as disks 130 of a disk array 120. The nodes 200are interconnected by a cluster switching fabric 150 which, in theillustrative embodiment, may be embodied as a Gigabit Ethernet switch.An exemplary distributed file system architecture is generally describedin U.S. Pat. No. 6,671,773 titled METHOD AND SYSTEM FOR RESPONDING TOFILE SYSTEM REQUESTS, by M. Kazar et al., issued Dec. 30, 2003. Itshould be noted that while there is shown an equal number of N andD-blades in the illustrative cluster 100, there may be differing numbersof N and/or D-blades in accordance with various embodiments of thepresent invention. For example, there may be a plurality of N-bladesand/or D-blades interconnected in a cluster configuration 100 that doesnot reflect a one-to-one correspondence between the N and D-blades. Assuch, the description of a node 200 comprising one N-blade and oneD-blade should be taken as illustrative only.

The clients 180 may be general-purpose computers configured to interactwith the node 200 in accordance with a client/server model ofinformation delivery. That is, each client may request the services ofthe node, and the node may return the results of the services requestedby the client, by exchanging packets over the network 140. The clientmay issue packets including file-based access protocols, such as theCommon Internet File System (CIFS) protocol or Network File System (NFS)protocol, over the Transmission Control Protocol/Internet Protocol(TCP/IP) when accessing information in the form of files anddirectories. Alternatively, the client may issue packets includingblock-based access protocols, such as the Small Computer SystemsInterface (SCSI) protocol encapsulated over TCP (iSCSI) and SCSIencapsulated over Fibre Channel (FCP), when accessing information in theform of blocks.

B. Storage System Node

FIG. 2 is a schematic block diagram of a node 200 that is illustrativelyembodied as a storage system comprising a plurality of processors 222a,b, a memory 224, a network adapter 225, a cluster access adapter 226,a storage adapter 228 and local storage 230 interconnected by a systembus 223. The local storage 230 comprises one or more storage devices,such as disks, utilized by the node to locally store configurationinformation (e.g., in configuration table 235) provided by one or moremanagement processes that execute as user mode applications. The clusteraccess adapter 226 comprises a plurality of ports adapted to couple thenode 200 to other nodes of the cluster 100. In the illustrativeembodiment, Ethernet is used as the clustering protocol and interconnectmedia, although it will be apparent to those skilled in the art thatother types of protocols and interconnects may be utilized within thecluster architecture described herein. In alternate embodiments wherethe N-blades and D-blades are implemented on separate storage systems orcomputers, the cluster access adapter 226 is utilized by the N/D-bladefor communicating with other N/D-blades in the cluster 100.

Each node 200 is illustratively embodied as a dual processor storagesystem executing a storage operating system 300 that preferablyimplements a high-level module, such as a file system, to logicallyorganize the information as a hierarchical structure of nameddirectories, files and special types of files called virtual disks(hereinafter generally “blocks”) on the disks. However, it will beapparent to those of ordinary skill in the art that the node 200 mayalternatively comprise a single or more than two processor system.Illustratively, one processor 222 a executes the functions of theN-blade 310 on the node, while the other processor 222 b executes thefunctions of the D-blade 350.

The memory 224 illustratively comprises storage locations that areaddressable by the processors and adapters for storing software programcode and data structures associated with the present invention. Theprocessor and adapters may, in turn, comprise processing elements and/orlogic circuitry configured to execute the software code and manipulatethe data structures. The storage operating system 300, portions of whichis typically resident in memory and executed by the processing elements,functionally organizes the node 200 by, inter alia, invoking storageoperations in support of the storage service implemented by the node. Itwill be apparent to those skilled in the art that other processing andmemory means, including various computer readable media, may be used forstoring and executing program instructions pertaining to the inventiondescribed herein.

The network adapter 225 comprises a plurality of ports adapted to couplethe node 200 to one or more clients 180 over point-to-point links, widearea networks, virtual private networks implemented over a publicnetwork (Internet) or a shared local area network. The network adapter225 thus may comprise the mechanical, electrical and signaling circuitryneeded to connect the node to the network. Illustratively, the computernetwork 140 may be embodied as an Ethernet network or a Fibre Channel(FC) network. Each client 180 may communicate with the node over network140 by exchanging discrete frames or packets of data according topre-defined protocols, such as TCP/IP.

The storage adapter 228 cooperates with the storage operating system 300executing on the node 200 to access information requested by theclients. The information may be stored on any type of attached array ofwritable storage device media such as video tape, optical, DVD, magnetictape, bubble memory, electronic random access memory, micro-electromechanical and any other similar media adapted to store information,including data and parity information. However, as illustrativelydescribed herein, the information is preferably stored on the disks 130of array 120. The storage adapter comprises a plurality of ports havinginput/output (I/O) interface circuitry that couples to the disks over anI/O interconnect arrangement, such as a conventional high-performance,FC link topology.

Storage of information on each array 120 is preferably implemented asone or more storage “volumes” that comprise a collection of physicalstorage disks 130 cooperating to define an overall logical arrangementof volume block number (vbn) space on the volume(s). Each logical volumeis generally, although not necessarily, associated with its own filesystem. The disks within a logical volume/file system are typicallyorganized as one or more groups, wherein each group may be operated as aRedundant Array of Independent (or Inexpensive) Disks (RAID). Most RAIDimplementations, such as a RAID-4 level implementation, enhance thereliability/integrity of data storage through the redundant writing ofdata “stripes” across a given number of physical disks in the RAIDgroup, and the appropriate storing of parity information with respect tothe striped data. An illustrative example of a RAID implementation is aRAID-4 level implementation, although it should be understood that othertypes and levels of RAID implementations may be used in accordance withthe inventive principles described herein.

C. Storage Operating System

To facilitate access to the disks 130, the storage operating system 300implements a write-anywhere file system that cooperates with one or morevirtualization modules to “virtualize” the storage space provided bydisks 130. The file system logically organizes the information as ahierarchical structure of named directories and files on the disks. Each“on-disk” file may be implemented as set of disk blocks configured tostore information, such as data, whereas the directory may beimplemented as a specially formatted file in which names and links toother files and directories are stored. The virtualization module(s)allow the file system to further logically organize information as ahierarchical structure of blocks on the disks that are exported as namedlogical unit numbers (luns).

In the illustrative embodiment, the storage operating system ispreferably the NetApp® Data ONTAP™ operating system available fromNetwork Appliance, Inc., Sunnyvale, Calif. that implements a WriteAnywhere File Layout (WAFL™) file system. However, it is expresslycontemplated that any appropriate storage operating system may beenhanced for use in accordance with the inventive principles describedherein. As such, where the term “WAFL” is employed, it should be takenbroadly to refer to any storage operating system that is otherwiseadaptable to the teachings of this invention.

FIG. 3 is a schematic block diagram of the storage operating system 300that may be advantageously used with the present invention. The storageoperating system comprises a series of software layers organized to forman integrated network protocol stack or, more generally, amulti-protocol engine 325 that provides data paths for clients to accessinformation stored on the node using block and file access protocols.The multi-protocol engine includes a media access layer 312 of networkdrivers (e.g., gigabit Ethernet drivers) that interfaces to networkprotocol layers, such as the IP layer 314 and its supporting transportmechanisms, the TCP layer 316 and the User Datagram Protocol (UDP) layer315. A file system protocol layer provides multi-protocol file accessand, to that end, includes support for the Direct Access File System(DAFS) protocol 318, the NFS protocol 320, the CIFS protocol 322 and theHypertext Transfer Protocol (HTTP) protocol 324. A VI layer 326implements the VI architecture to provide direct access transport (DAT)capabilities, such as RDMA, as required by the DAFS protocol 318. AniSCSI driver layer 328 provides block protocol access over the TCP/IPnetwork protocol layers, while a FC driver layer 330 receives andtransmits block access requests and responses to and from the node. TheFC and iSCSI drivers provide FC-specific and iSCSI-specific accesscontrol to the blocks and, thus, manage exports of luns to either iSCSIor FCP or, alternatively, to both iSCSI and FCP when accessing theblocks on the node 200.

In addition, the storage operating system includes a series of softwarelayers organized to form a storage server 365 that provides data pathsfor accessing information stored on the disks 130 of the node 200. Tothat end, the storage server 365 includes a file system module 360 incooperating relation with a volume striping module (VSM) 370, a RAIDsystem module 380 and a disk driver system module 390. The RAID system380 manages the storage and retrieval of information to and from thevolumes/disks in accordance with I/O operations, while the disk driversystem 390 implements a disk access protocol such as, e.g., the SCSIprotocol. The VSM 370 illustratively implements a striped volume set(SVS) and cooperates with the file system 360 to enable storage server365 to service a volume of the SVS. In particular, the VSM 370implements a Locate( ) function 375 to compute the location of datacontainer content in the SVS volume to thereby ensure consistency ofsuch content served by the cluster.

The file system 360 implements a virtualization system of the storageoperating system 300 through the interaction with one or morevirtualization modules illustratively embodied as, e.g., a virtual disk(vdisk) module (not shown) and a SCSI target module 335. The vdiskmodule enables access by administrative interfaces, such as a userinterface of a management framework (not shown), in response to a user(system administrator) issuing commands to the node 200. The SCSI targetmodule 335 is generally disposed between the FC and iSCSI drivers 328,330 and the file system 360 to provide a translation layer of thevirtualization system between the block (lun) space and the file systemspace, where luns are represented as blocks.

The file system 360 is illustratively a message-based system thatprovides logical volume management capabilities for use in access to theinformation stored on the storage devices, such as disks. That is, inaddition to providing file system semantics, the file system 360provides functions normally associated with a volume manager. Thesefunctions include (i) aggregation of the disks, (ii) aggregation ofstorage bandwidth of the disks, and (iii) reliability guarantees, suchas mirroring and/or parity (RAID). The file system 360 illustrativelyimplements the WAFL file system (hereinafter generally the“write-anywhere file system”) having an on-disk format representationthat is block-based using, e.g., 4 kilobyte (kB) blocks and using indexnodes (“inodes”) to identify files and file attributes (such as creationtime, access permissions, size and block location). The file system usesfiles to store meta-data describing the layout of its file system; thesemeta-data files include, among others, an inode file. A file handle,i.e., an identifier that includes an inode number, is used to retrievean inode from disk.

Broadly stated, all inodes of the write-anywhere file system areorganized into the inode file. A file system (fs) info block specifiesthe layout of information in the file system and includes an inode of afile that includes all other inodes of the file system. Each logicalvolume (file system) has an fsinfo block that is preferably stored at afixed location within, e.g., a RAID group. The inode of the inode filemay directly reference (point to) data blocks of the inode file or mayreference indirect blocks of the inode file that, in turn, referencedata blocks of the inode file. Within each data block of the inode fileare embedded inodes, each of which may reference indirect blocks that,in turn, reference data blocks of a file.

Operationally, a request from the client 180 is forwarded as a packetover the computer network 140 and onto the node 200 where it is receivedat the network adapter 225. A network driver (of layer 312 or layer 330)processes the packet and, if appropriate, passes it on to a networkprotocol and file access layer for additional processing prior toforwarding to the write-anywhere file system 360. Here, the file systemgenerates operations to load (retrieve) the requested data from disk 130if it is not resident “in core”, i.e., in memory 224. If the informationis not in memory, the file system 360 indexes into the inode file usingthe inode number to access an appropriate entry and retrieve a logicalvbn. The file system then passes a message structure including thelogical vbn to the RAID system 380; the logical vbn is mapped to a diskidentifier and disk block number (disk,dbn) and sent to an appropriatedriver (e.g., SCSI) of the disk driver system 390. The disk driveraccesses the dbn from the specified disk 130 and loads the requesteddata block(s) in memory for processing by the node. Upon completion ofthe request, the node (and operating system) returns a reply to theclient 180 over the network 140.

It should be noted that the software “path” through the storageoperating system layers described above needed to perform data storageaccess for the client request received at the node may alternatively beimplemented in hardware. That is, in an alternate embodiment of theinvention, a storage access request data path may be implemented aslogic circuitry embodied within a field programmable gate array (FPGA)or an application specific integrated circuit (ASIC). This type ofhardware implementation increases the performance of the storage serviceprovided by node 200 in response to a request issued by client 180.Moreover, in another alternate embodiment of the invention, theprocessing elements of adapters 225, 228 may be configured to offloadsome or all of the packet processing and storage access operations,respectively, from processor 222, to thereby increase the performance ofthe storage service provided by the node. It is expressly contemplatedthat the various processes, architectures and procedures describedherein can be implemented in hardware, firmware or software.

As used herein, the term “storage operating system” generally refers tothe computer-executable code operable on a computer to perform a storagefunction that manages data access and may, in the case of a node 200,implement data access semantics of a general purpose operating system.The storage operating system can also be implemented as a microkernel,an application program operating over a general-purpose operatingsystem, such as UNIX® or Windows NT®, or as a general-purpose operatingsystem with configurable functionality, which is configured for storageapplications as described herein.

In addition, it will be understood to those skilled in the art that theinvention described herein may apply to any type of special-purpose(e.g., file server, filer or storage serving appliance) orgeneral-purpose computer, including a standalone computer or portionthereof, embodied as or including a storage system. Moreover, theteachings of this invention can be adapted to a variety of storagesystem architectures including, but not limited to, a network-attachedstorage environment, a storage area network and disk assemblydirectly-attached to a client or host computer. The term “storagesystem” should therefore be taken broadly to include such arrangementsin addition to any subsystems configured to perform a storage functionand associated with other equipment or systems. It should be noted thatwhile this description is written in terms of a write anywhere filesystem, the teachings of the present invention may be utilized with anysuitable file system, including a write in place file system.

D. SpinNP Network Protocol

In the illustrative embodiment, the storage server 365 is embodied asD-blade 350 of the storage operating system 300 to service one or morevolumes of array 120. In addition, the multi-protocol engine 325 isembodied as N-blade 310 to (i) perform protocol termination with respectto a client issuing incoming data access request packets over thenetwork 140, as well as (ii) redirect those data access requests to anystorage server 365 of the cluster 100. Moreover, the N-blade 310 andD-blade 350 cooperate to provide a highly-scalable, distributed storagesystem architecture of the cluster 100. To that end, each blade includesa cluster fabric (CF) interface module 500 a,b adapted to implement anetwork protocol that enables intra-cluster communication among theblades, as described herein.

The protocol layers, e.g., the NFS/CIFS layers and the iSCSI/FC layers,of the N-blade 310 function as protocol servers that translatefile-based and block based data access requests from clients intonetwork protocol messages used for communication with the D-blade 350.That is, the N-blade servers convert the incoming data access requestsinto primitive operations (commands) that are embedded within messagesby the CF interface module 500 for transmission to the D-blades 350 ofthe cluster 100. Notably, the CF interface modules 500 cooperate toprovide a single file system image across all D-blades 350 in thecluster 100. Thus, any network port of an N-blade that receives a clientrequest can access any data container within the single file systemimage located on any D-blade 350 of the cluster.

Further to the illustrative embodiment, the N-blade 310 and D-blade 350are implemented as separately-scheduled processes of storage operatingsystem 300; however, in an alternate embodiment, the blades may beimplemented as pieces of code within a single operating system process.Communication between an N-blade and D-blade is thus illustrativelyeffected through the use of message passing between the blades although,in the case of remote communication between an N-blade and D-blade ofdifferent nodes, such message passing occurs over the cluster switchingfabric 150. A known message-passing mechanism provided by the storageoperating system to transfer information between blades (processes) isthe Inter Process Communication (IPC) mechanism.

The network protocol illustratively described herein is the Spin networkprotocol (SpinNP) that comprises a collection of methods/functionsconstituting a SpinNP application programming interface (API). SpinNP isa proprietary protocol of Network Appliance of Sunnyvale, Calif. Theterm SpinNP is used herein without derogation of any trademark rights ofNetwork Appliance, Inc. The SpinNP API, in this context, is a set ofsoftware calls and routines that are made available (exported) by aprocess and that can be referenced by other processes. As describedherein, all SpinNP protocol communication in the cluster occurs viaconnections. Communication is illustratively effected by the D-bladeexposing the SpinNP API to which an N-blade (or another D-blade) issuescalls. To that end, the CF interface module 500 is organized as a CFencoder and CF decoder. The CF encoder of, e.g., CF interface 500 a onN-blade 310 encapsulates a SpinNP mesas sage as (i) a local procedurecall (LPC) when communicating a command to a D-blade 350 residing on thesame node 200 or (ii) a remote procedure call (RPC) when communicatingthe command to a D-blade residing on a remote node of the cluster 100.In either case, the CF decoder of CF interface 500 b on D-blade 350de-encapsulates the SpinNP message and processes the command.

FIG. 4 is a schematic block diagram illustrating the format of a SpinNPmessage 400 in accordance with an embodiment of with the presentinvention. The SpinNP message 400 is illustratively used for RPCcommunication over the switching fabric 150 between remote blades of thecluster 100; however, it should be understood that the term “SpinNPmessage” may be used generally to refer to LPC and RPC communicationbetween blades of the cluster. The SpinNP message 400 includes a mediaaccess layer 402, an IP layer 404, a UDP layer 406, a reliable transportlayer, such as a reliable connection (RC) layer 408, and a SpinNPprotocol layer 410. As noted, the SpinNP protocol conveys commandsrelated to operations contained within, e.g., client requests to accessdata containers stored on the cluster 100; the SpinNP protocol layer 410is that portion of message 400 that carries those commands.Illustratively, the SpinNP protocol is datagram based and, as such,involves transmission of messages or “envelopes” in a reliable mannerfrom a sender (e.g., an N-blade 310) to a receiver (e.g., a D-blade350). The RC layer 408 implements a reliable transport protocol that isadapted to process such envelopes in accordance with a connectionlessprotocol, such as UDP 406.

According to the invention, the SpinNP network protocol is amulti-layered protocol that integrates a session infrastructure and anapplication operation set into a session layer that obviatesencapsulation and buffering overhead typically associated with protocollayering. The session layer manages the establishment and termination ofsessions between blades in the cluster and is illustratively built upona connection layer that defines a set of functionality or servicesprovided by a connection-oriented protocol. The connection-orientedprotocol may include a framing protocol layer over a network transport,such as RC and/or TCP, or a memory-based IPC protocol. These connectionsare formed via the network transport, or via the local memory-to-memoryor adapter-to-memory transport, and provide a packet/message transportservice with flow control. It should be noted that otherconnection-oriented protocols, perhaps over other transports, can beused, as long as those transports provide the same minimum guaranteedfunctionality, e.g., reliable message delivery.

The SpinNP network protocol is illustratively a request/responseprotocol wherein a blade (requester) receiving a data access requestfrom a client redirects that request to another blade (responder) thatservices the request and, upon completion, returns a response. Thenetwork protocol is illustratively implemented by the CF interfacemodules 500 and, as such, a SpinNP session provides a context foruni-directional flow of request messages (requests) and uni-directionalflow of corresponding response messages (responses) to those requests.Each request consists of one SpinNP message and generates one response,unless the connection is lost or the session terminates abnormally. FIG.5 is a schematic block diagram illustrating the organization of the CFinterface modules 500 a,b adapted to implement the SpinNP protocol inaccordance with an embodiment of the present invention. Each module 500a,b comprises a SpinNP session layer 510 a,b and a connection layer 550a,b.

The SpinNP session layer 510 allows implementation of differentoperation protocols, hereinafter referred to generally as “operationinterfaces”. Examples of such interfaces include a session interface 512that defines a set of protocol operations that is used to provide thesession infrastructure and a file operations interface 514 that definesfile access operations that are generally translated requests comingfrom external clients. Other interfaces implemented by the session layerinclude those used by data management, system management or other“application” subsets of cluster functionality, as needed. Notably, thesession infrastructure operations exist in the network protocol at thesame level of encapsulation as the application operations to enable anefficient and highly functional implementation. All interfaces sharecommon features of the session layer, including credentials,authentication, verification, sessions, recovery, and response caches.Each operation provided by an interface is illustratively defined by aninterface number coupled with a procedure number.

As noted, the SpinNP network protocol 410 relies on connections forreliable message delivery. As such, a session 600 is disposed over oneor more connections 560 and is illustratively established between a pairof blades or other participants. For example, a session can beestablished between D-blades, between an N-blade and a D-blade, andbetween N-blades (if there proves to be a need for N-blade-to-N-bladeSpinNP calls). The session can also be used to inter-connect otherentities or agents, including user-space processes and services, toblades or to each other. Each pair of blades typically requires only onesession to communicate; however, multiple sessions can be openedsimultaneously between the same pair of blades. Each session requiresbi-directional request flow over the same connection. The session 600also provides an infrastructure that makes messages secure and supportsrecovery without requiring an additional protocol layer between thenetwork transport layer (RC or TCP) and the application layer (e.g.,file access operations). Each session is independently negotiated andinitiated to thereby enable a high level of message concurrency andasynchrony.

The connections 560 are established by the connection layers 510 a,b andprovide the network transport for the sessions between the blades. Atleast one connection is required for each session, wherein theconnection is used for both requests and responses. Although more thanone connection can be bound to a session, only connections that arebound to the session can be used to carry the requests and responses forthat session. The connections 560 are bi-directional, allowing messageflow in each direction. For example, requests flow in both directions oneach session, thereby allowing forward (operational) and reverse(callback) flows to be sent through the same session. Responses for bothdirections of request flow are also carried in the session. Connectionsthat are bound to sessions cannot be shared by multiple sessions;however, multiple sessions may be multiplexed onto a single connection.That is, operational and callback sessions between an N-blade/D-bladepair can be multiplexed onto a single connection. Sessions can alsomultiplex operations for different clients and different users.

Each session 600 is illustratively identified by a globally uniqueidentifier (id) formed of the universal unique ids (UUIDs) of its twoparticipant blades, with the session initiator's UUID listed first. Theglobally unique id is combined with a 64-bit uniquifier that is uniquefor all concurrent sessions between the pair of blades, regardless ofwhich blade is the initiator, as well as for any dormant recoverablesession for which any state is still stored on either of the two blades.The uniquifier may be generated using the current time, indicating thetime of constructing a session initiation operation, i.e.,SPINNP_CREATE_SESSION, conveyed within an appropriate request. Theresulting session id uniquifier is then confirmed to be unique by thereceiver blade. Note that the id uniquifier should be unique unless bothblades are trying to create a session to each other simultaneously. Ifso, each blade can counter-propose a different session id, possibly bysimply adding a small random number to the original proposed session iduniquifier.

In the illustrative embodiment, each connection 560 has an assignedpriority level and each session 600 is bound to at least threeconnections, each of which is independently flow-controlled and has adifferent priority level. Illustratively, the connections include a highpriority level connection 562, a medium priority level connection 564and a low priority connection level 566. The priority level indicatesthe minimum priority of message that the connection will accept. To thatend, each request has one of the three priority levels: high, medium andlow. Every response is sent with the same priority as its request. Lowpriority is used for the vast majority of requests and, as such, eachsession may include multiple low priority connections 566. Mediumpriority is used for some callback requests. Callback requests arerequests that flow in the reverse of the typical direction, e.g., fromserver to client. The medium priority callback requests are thoserequests that are issued to inform the client that it must take someaction that will allow the server to free some resources or unblock adifferent client. Finally, high priority is reserved for requests thatthe client issues to fulfill the demands of a callback. SpinNP sessionoperations can be performed at any priority.

E. SpinNP Channels

Each session comprises a plurality of channels disposed over theconnections that, unlike a session, are not bound to the channels. FIG.6 is a schematic block diagram illustrating channels 620 of a session600 in accordance with an embodiment of the present invention. A channel620 is a light-weight construct that enables multiple requests to besent asynchronously over a connection 560. Each channel 620 isillustratively embodied as a request buffer (request window 630) capableof storing a plurality of in-flight requests. Within a session, thesession layer 510 selects any request window 630 with available space tosend a request, thereby obviating the possibility of one long-running orlost request (or response) blocking the progress (performance) of thesession. Each request window 630 has a predetermined initial sequencewindow size and the total number of outstanding requests in a session isthe sum of the window sizes of all the channels in the session.

Moreover, each channel 620 has an assigned priority level, e.g., highpriority channel 622, medium priority channel 624 and low prioritychannel 626. Although this arrangement imposes a binding betweenchannels and connections of a particular priority level, the requestsfor any number of channels at that priority level can be sent over anyset of connections used to service that priority level. That is, anyrequest from a channel 620 that is staged in a request window 630 can besent over any connection 560, as long as the priority levels of therequest, channel and connection are the same. Although a request isassociated with a channel 620 of the session layer 510, this notiondisappears at the connection layer 550 (and connections 560).

Notably, there is no mapping between channels and connections; e.g.,requests within a channel 620 may be distributed among (sent over)different connections 560 of the same priority, primarily because thesession layer 510 performs its own matching of request to responsemessages within various sessions. This aspect of the invention enablesthe SpinNP session layer 510 to multiplex (i.e., send) requests fromchannels 620 (request windows 630) over any connection 560 that isavailable at the proper priority level. Any messages delivered over achannel can be annotated at the receiver with the priority level, whichcan speed the processing of higher priority messages through the layersof processing at the receiver. Note that certain numbers of connectionsare always kept clear of low priority traffic to keep higher prioritytraffic from being delayed unnecessarily by low priority traffic;however, any connection can, in theory, carry any priority of request.It should be noted that a message sent over a channel of a givenpriority may be sent over any connection of that specified priority orlower. Thus, a message sent over a high priority channel may utilize alow, medium or high priority connection.

Each session 600 illustratively contains a limited number of channels620, defined during session negotiation. Initially, each channel 620 isopened with a sequence window size of one; however, the window size forany channel can be subsequently negotiated via aSPINNP_SET_SEQ_WINDOW_SIZE operation. The total number of outstandingrequests in a session is the sum of the window sizes of all the channelsin the session. This total is also negotiated at session creation andcan be renegotiated at any time. Every time a channel's sequence windowis resized, the new window size is counted against the total budgetavailable to the session.

Each channel 620 is identified by a channel number, which is uniquewithin the direction of request flow in the session. In addition, eachrequest has a sequence number that is guaranteed to be unique for thatrequest and that specifies its sequence in the channel. Illustratively,the unique sequence number of each request is one greater than thesequence number of the request that immediately precedes it in thechannel. In alternate embodiments, the sequence number may bedecremented from the sequence number immediately preceding it. The useof unique sequence numbers for requests prevents reexecution of replayedor duplicated requests, and allows the detection of lost requests in asession. Sequence numbers in each channel wrap-around when the maximumsequence number is reached. The requester is generally required to issueall requests in a channel in strictly increasing order untilwrap-around, without skipping any sequence numbers. At wrap-around, thesequence decreases from its maximum value to zero, then resumes itsstrictly increasing pattern, e.g., S(n)=n mod 2⁶⁴, where S(n) is thesequence number of the nth request sent on the channel.

Moreover, each request is identified by a unique identifier (“requestid”), which is placed in a request header of the request message. Arequest id is generally defined as the combination of a channel numberand a sequence number. Each response includes the request id of itscorresponding request in a response header of the response message.Requests are otherwise distinguished from responses by a protocol tagbyte in the message header, so that each message in a session isguaranteed to be unique. Note that the session layer 510 does not dependupon ordering or identifying properties of the connections 560 toresolve the association of a request to a channel 620, or its sequencein that channel.

Windowing is used within each channel 620 to accomplish flow control,bounding the maximum number of outstanding requests per channel, andtherefore the total maximum number of outstanding requests per session.Request windowing is defined by the combination of a per requestsequence number and a sequence window maintained on the responder. Onlyrequests that fall within the current window of the request channel areaccepted for processing by the responder. Any requests outside of thecurrent window are failed promptly with a SPINNP_ERR_BADSEQ response.The window of requests initially accepted starts at sequence number 0and extends to the sequence number equal to that channel's sequencewindow size w minus 1. The window on the responder is only advanced whenthe responder sends the response to the oldest outstanding request (theone with the lowest sequence number). The window of sequence numbersthat the requester is allowed to send is correspondingly advanced whenit receives the response to the oldest outstanding request. Therequester can then advance the window by the number of contiguouslynumbered responses that it has received at the tail of the window inthat channel.

In other words, the responder advances the window of requests it willaccept in a channel when it sends a response for the oldest outstandingrequest in the window. At any time, the maximum sequence number that canbe accepted in a channel equals the lowest sequence number of anyrequest that has not been responded to, plus w−1. The requester can senda request with sequence number (n+w) mod 2⁶⁴ when it receives theresponse for the request with sequence number n. Note that the sequencewindow affects the size of a response cache, if such a cache is kept.Response cache entries are preserved in the response cache until theresponder receives confirmation that a response has been received. Thisconfirmation is received implicitly for the request with sequence numbern when the request with sequence number n+w is received, where w is thewindow size.

Connections 560 can also be unbound from a session 600, which isgenerally performed during the process of closing a connection.Unbinding a connection from a session ensures that the connection isflushed of all outstanding requests and responses. All but oneconnection can be unbound from a session at a time without destroyingthe session to which it is bound. Unbinding the connection from asession does not cause the termination of the session. An abandonedsession will eventually time itself out and terminate. However, asession that is reconnected before the timeout period expires does notlose its session state or identity. A connection can buffer and queuerequests and responses, but it is expected to deliver complete messagesto a SpinNP target as quickly as possible.

Specifically, a session 600 is closed by a SPINNP_CLOSE_SESSIONoperation, which also unbinds the last connection in the session.Individual connections can be disassociated from a session by aSPINNP_UNBIND_CONNECTION operation. Session termination unbinds allconnections in the session. Safe termination of a session requires thatall requests in the connections are delivered, and all the matchingresponses are received before the connections are unbound. Immediatetermination of a session unbinds the connections without guaranteeingdelivery of outstanding requests or responses. The SPINNP_CLOSE_SESSIONoperation takes an enumerator argument to specify the manner in whichconnections are unbound in the session. Immediate session terminationshould only be used in the event of a failure where rapid recovery isneeded, or in the event of an immediate need to remove a node from thecluster.

F. Batch Execution Ordering

The present invention is directed to a system and method for specifyingbatch execution ordering of requests in a cluster of nodes. The strictsequence numbering of requests in each channel provides a capability ofdefining the ordering of request execution within the channel. Accordingto an aspect of the invention, the request id is extended to include abatch number that provides an execution ordering directive within achannel. That is, each request is also assigned a batch number used toimpose ordering of execution the request within the channel. Allrequests with the same batch number in a channel can be executed inarbitrary order or concurrently by the responder. Any requests that havedifferent batch numbers in the same channel are executed in order ofascending batch number. Illustratively, requests within differentchannels may be executed in an arbitrary order with respect to eachother.

Any number of contiguous requests (i.e., requests with a contiguous setof sequence numbers) in a channel can be issued with the same batchnumber. Ordering is imposed only when the batch number changes, e.g.,increases. Illustratively, the batch number increases monotonically inorder of increasing sequence number, such that B(s1)>=B(s2) if s1>s2where s1 and s2 are sequence numbers and B(s) is the batch number of therequest with sequence number s. Moreover, the batch numberillustratively increases only in increments of one, e.g., eitherB(n+1)=B(n) or B(n+1)=(B(n)+1) mod 2³², where B(n) is the batch numberof the nth request sent on a channel. Although more than one request ina channel can have the same batch number, all requests with the samebatch number B are executed before any request with batch number B+1 orhigher.

In the illustrative embodiment, the batch number is a 32-bit value,allowing window sizes to be effectively unlimited (maximum of 2³²−1).The number of requests in a channel is generally limited to a sequencewindow size, with the outstanding requests having sequence numbers thatfall within the range of the sequence window of each other. In addition,the magnitude of the batch numbers is large enough such that the numberscannot wrap-around within the sequence window, i.e., bmax>seq_window.Nevertheless, the batch number can wrap-around independently of thesequence number. That is, batch numbers and sequence numbers canwrap-around independently in a binary numbering scheme.

FIG. 7 is a schematic block diagram illustrating the use of batchnumbers within a channel of a session in accordance with the presentinvention. Each channel 620 is illustratively embodied as a requestwindow 630 within the session layer 510 a (e.g., at a requesterblade/element) and a response window 640 within session layer 510 b(e.g., at a responder blade/element). Each window 630, 640 has asequence number range for storing outstanding requests sent over aconnection 560; each request is identified by a unique request id 700:

Request ID 700=Channel Number 710+Sequence Number 720+Batch Number 730

wherein (i) the channel number 710 specifies the channel 620 over whichthe request is sent from, e.g., an N-blade 310 to a D-blade 350, (ii)the sequence (seq) number 720 specifies the sequence of that requestwithin the channel and (iii) the batch number 730 specifies the orderingimposed on that request within the channel. The request (req) id 700thus specifies the order in which requests are sent over the channelbetween the blades in the cluster.

As noted, requests (i.e., Req ID 700) having the same batch number 730within a channel can be executed at a responder (e.g., D-blade 350) inany order. For example, requests with seq numbers 1-5 can be executed inany order because they are all associated with batch number 1. However,execution of each of those requests must be completed before the requestwith seq number 6 can be executed because the latter request isassociated with a different batch number, e.g., batch number 2.Similarly, execution of the request with seq number 6 must be completedbefore the request with seq number 7 can be executed because that laterrequest is associated with batch number 3.

According to another aspect of the invention, the responder does notexecute a request associated with a different batch number until itidentifies a transition or boundary between an immediately precedingbatch number and a next batch number, and determines that allintervening requests associated with the preceding batch number havebeen completed. In this context, a “boundary” may be defined as thepoint at which the preceding seq number s in the preceding batch numberB(n) moves to the next seq number s+1 in the next batch number B(n+1). Akey to the operation of batch numbering is that boundaries betweenadjacent batches can be identified with complete certainty, since thesequence numbers 720 establish an exact order in which the requests areissued, regardless of their order of arrival at the responder. Once thefirst request in a batch is identified and all requests in theimmediately preceding batch have been executed, any requests in the next(current) batch that have been received by the responder can beexecuted, even if the entire batch has not yet been seen. The respondermaintains a current batch index, and any request arriving with thatbatch number can be dispatched immediately. Any request with a higherbatch number is delayed until the transition from the previous batchnumber to the new batch number is observed in a pair of requests thathave adjacent sequence numbers, and all requests in the previous batchhave been received and processed.

Batch numbering can be used to achieve several different orderingbehaviors within a channel. For example, a completely unordered set ofrequests can be sent on a channel by issuing all the requests with thesame batch number. Such un-ordered behavior can extend indefinitely,although the number of outstanding requests at any one time is alwayslimited by the size of the sequence window. In addition, a strictlyordered sequence of requests can be issued with strictly increasingbatch numbers, incremented by one each time. Furthermore, a mixture ofordered and unordered operations can be sent on a channel. As anexample, a requester may first lock a byte range of a file, then performmultiple unordered I/O operations to that byte range.

A common usage of batch execution ordering involves SCSI protocolprocessing, wherein barrier operations are inserted into a channel ofrequests that is otherwise unordered within arbitrarily large groups ofrequests. All operations occurring prior to the barrier must becompleted before any operations after the barrier are executed.According to the invention, ordering can be achieved by incrementing thebatch number when a barrier is encountered. Batch numbering of requestsfurther allows the benefits of explicit request ordering controls, whilealso allowing request chaining (as in DAFS) without depending onin-order message delivery. This feature of the invention offers thebenefits of NFSv4 compound without its extra layer of requestencapsulation.

FIG. 8 is a flowchart illustrating a procedure 800 for specifying batchexecution ordering of requests in accordance with an embodiment thepresent invention. The procedure 800 illustrates the steps performed bya requestor originating a series of requests. The procedure starts instep 805 and continues to step 810 where requester initializes thesequence numbers to be utilized. Then, in step 815, the requesterinitializes the batch numbers to be utilized. This initialization ofsequence and batch numbers may be accomplished by starting the sequenceand batch numbers from predetermined values, e.g., zero. In step 820, asequence number and a batch number are assigned to a request, which isthen sent to the destination (responder) in step 825. The requesterthen, in step 830, determines whether it has completed the currentbatch. If it has completed the current batch, the requester branches tostep 835 and increments the batch number before continuing to step 840.However, if the batch has not been completed, the requester branchesfrom step 830 to step 840. In step 840 the requestor increments thesequence number. The requester then determines in step 845, whetherthere are additional requests. If there are no additional requests, theprocedure ends in step 850. However, if there are additional requests,the procedure branches back to step 820 and the next request is assignedthe newly incremented sequence number and batch number.

FIGS. 9A and 9B are flowcharts illustrating a procedure 900 forprocessing received requests including batch numbers by a responder inaccordance with an embodiment of the present invention. The procedure900 begins in step 905 and proceeds to step 910 where the responderinitializes the current sequence number. Then, in step 915, theresponder initializes the current batch number. The responder thenreceives a request in step 920. In step 925, the responder determines ifthe sequence number of the request is within an acceptable range. Theacceptable range is illustratively the window size. For example, if thewindow size is 10 and the current sequence number is 70, only thosemessages with sequence numbers 70-79 are within the window. If so, theresponder then, in step 930 determines whether the request sequencenumber has already been utilized. If the answer is negative for step 925or yes for step 930, the responder branches to step 935 and returns arejection message.

However, if the sequence number is in the appropriate range and thesequence number has not previously been utilized, the responder thenmarks the sequence number as used in step 940. The responder thendetermines whether the batch number associated with request equals thecurrent batch number. If the batch numbers match, the respondercontinues to step 1000 where the request is performed. Step 1000 isdescribed in further detail below in reference to FIG. 10. Once therequest is performed, the responder determines, in step 955, whetherthere are additional requests. If there are no additional requests, theprocedure ends in step 960. However if, in step 955, it is determinedthat there are additional requests, the responder loops back to step 920to receive the next request.

If, in step 945 it is determined that the batch number associated withthe request does not equal the current batch number, the responderbranches to step 965 where it determines if the request's batch numberequals the current batch number plus one. If it does not, the requesterbranches to step 985, where the responder enqueues the request for laterprocessing before determining, in step 990, whether additional requests.If there are no additional requests, the responder ends in step 995.However, if there are additional requests, the procedure loops back tostep 920.

If in step 965 it is determined that the request's batch number equalsthe current batch number plus one, the responder continues to step 970,where a determination is made whether all requests up to the sequencenumber have been received. If so, the batch number is incremented instep 975 and all enqueued requests with the new batch number areperformed in step 980. The responder then continues to step 1000 toperform the current request. If, in step 970 it is determined that allrequests up to the sequence number have not been received, the responderbranches to step 985 and enqueues the request as described above.

FIG. 10 is a flowchart illustrating a procedure 1000 for performing therequest in accordance with an embodiment the present invention. Theprocedure 1000 begins in step 1005 and continues to step 1010 where therequest is processed. This may be accomplished by, for example, passingthe operations to the file system for processing. The request's responseis then sent in step 1015. The response may comprise a status indicatoror, in the case of a read operation, the requested data. Then, in step1020, a determination is made whether the request's sequence numberequals the current sequence number. If they are not equal, the responderbranches to step 1035 and ends. However if they are equal, the sequencewindow may then be propagated forward as the oldest sequence number hasbeen processed. As such, the procedure then increments the currentsequence number in step 1025 before deciding, in step 1030 whether thecurrent sequence number has already had a response sent. if a responsehas not already been sent, the procedure then ends in step 1035.However, if a response has already been sent, the procedure loops backto step 1025 and further increments the current sequence number.

Advantageously, batch execution ordering allows multiple requests to beexecuted concurrently or out of sequence, while explicitly requiringordering among subsets of requests. That is, the use of batch numberswithin a channel allows imposition of an ordering constraint on requestsin the channel, as well as issuance of multiple unordered requests inthe channel. Layering of a batch number on a request ID allows immediateand certain detection of a boundary between batches with no danger oferror. In other words, the batch number enables a responder to determinewhether a request can be immediately executed or must be stalled, andthis determination can always be made optimally based on the requestsreceived at that point.

Moreover, batch numbering allows a client to specify a precise orderingof batches of requests of any size with respect to each other. Thisprovides a solution to constraints imposed on network protocols by SCSI,NFS, CIFS and any arbitrary protocol that may require ordering ofrequest execution, while retaining the benefits of flow control,resource constraining and immunity to long-running requests, provided bymultiple channels and per-request sequence numbers with predeterminedsequence windows. Strict ordering is possible simply by incrementing thebatch number by one for every request sent. Complete unordered executionis possible by sending all requests with the same batch number. Anyintermediate level of ordering is possible, including sending a streamof unordered requests with the knowledge that some future request mayneed to be ordered, but without knowing how many requests need to beissued before the request requiring ordering is issued.

Batch ordering further provides a substantial improvement over theordering mechanism in NFS and improves upon the ordering mechanism inDAFS, while supporting the type of ordering needed to achieve anefficient implementation of SCSI in a client/server model. The novelordering capability provided by the batch numbers is provided at littlecost in either requester/responder endpoint of the session. Bothendpoints maintain a current batch number and the responder enqueuesrequests that are from a higher batch than the current batch number.However, the number of such requests in a channel is limited by thesequence number window size.

The foregoing description has been directed to particular embodiments ofthis invention. It will be apparent, however, that other variations andmodifications may be made to the described embodiments, with theattainment of some or all of their advantages. Specifically, it shouldbe noted that the principles of the present invention may be implementedin non-distributed file systems. Furthermore, while this description hasbeen written in terms of N and D-blades or elements, the teachings ofthe present invention are equally suitable to systems where thefunctionality of the N and D-blades are implemented in a single system.Alternately, the functions of the N and D-blades may be distributedamong any number of separate systems, wherein each system performs oneor more of the functions. Additionally, the procedures, processes,layers and/or modules described herein may be implemented in hardware,software, embodied as a computer-readable medium having programinstructions, firmware, or a combination thereof. Therefore, it is theobject of the appended claims to cover all such variations andmodifications as come within the true spirit and scope of the invention.

What is claimed is:
 1. A computer data storage system apparatus,comprising: a plurality of requests received from a client, each requestof the plurality of requests having assigned a unique sequence number,each request being an input/output request to a data storage device; aplurality of subsets of requests formed by dividing the plurality ofrequests into subsets; a unique batch number assigned to each subset ofrequests; a processor to execute a first subset of requests having afirst batch number in arbitrary order with respect to the sequencenumber of each request; and the processor to execute a second subset ofrequests having a second batch number in arbitrary order with respect tothe sequence number of each request, after execution of all of the firstsubset of requests, having the first batch number, have completed, whereexecution of the second subset of requests, further comprises: theprocessor further configured to receive a particular request having adifferent batch number than the first batch number and the second batchnumber, enqueue the particular request in response to the differentbatch number of the particular request not being the second batch numberplus one, and perform at least one of enqueing the particular requestand processing the particular request in response to the different batchnumber of the particular request being the second batch number plus one.2. The apparatus as in claim 1, further comprising: an optical storagedevice used as the data storage device.
 3. The apparatus as in claim 1,further comprising: a magnetic tape used as the data storage device. 4.The apparatus as in claim 1, further comprising: a bubble memory used asthe data storage device.
 5. The apparatus as in claim 1, furthercomprising: an electronic memory used as the data storage device.
 6. Theapparatus as in claim 1, further comprising: a micro-electro mechanicaldevice used as the data storage device.
 7. The apparatus as in claim 1,further comprising: a media configured to store information used as thedata storage device.
 8. A method for operating a computer data storagesystem, comprising: receiving a plurality of requests from a client,each request of the plurality of requests having assigned a uniquesequence number, each request being an input/output request to a datastorage device; dividing the plurality of requests into a plurality ofsubsets of requests; assigning a unique batch number to each subset ofrequests so that each subset of requests is assigned a unique batchnumber; and using the batch number as an execution ordering directive sothat a plurality of requests having the same batch number are executedbefore a plurality of requests having a second batch number, andexecution of the requests with the same batch number is arbitrary of thesequence number of the requests, wherein during the execution of acurrent batch number, a particular request is received and associatedwith a different batch number that is different than the current batchnumber being executed, and the particular request is at least one ofenqueued or processed based on the different batch number being thecurrent batch number plus one.
 9. The method as in claim 8, furthercomprising: using an attached array of writable storage device media asthe data storage device.
 10. The method as in claim 8, furthercomprising: using an optical storage device as the data storage device.11. The method as in claim 8, further comprising: using a magnetic tapeas the data storage device.
 12. The method as in claim 8, furthercomprising: using a bubble memory as the data storage device.
 13. Themethod as in claim 8, further comprising: using an electronic memory asthe data storage device.
 14. The method as in claim 1, furthercomprising: using a micro-electro mechanical device as the data storagedevice.
 15. The method as in claim 8, further comprising: using a mediaconfigured to store information as the data storage device.
 16. Acomputer readable medium containing executable program instructionsexecuted by a processor, comprising: program instructions that receive aplurality of requests from a client, each request of the plurality ofrequests having assigned a unique sequence number, each request being aninput/output request to a data storage device; program instructions thatdivide the plurality of requests into a plurality of subsets ofrequests; program instructions that assign a unique batch number to eachsubset of requests so that each subset of requests is assigned a uniquebatch number; and program instructions that use the batch number as anexecution ordering directive so that a plurality of requests having thesame batch number are executed before a plurality of requests having asecond batch number, and execution of the requests with the same batchnumber is arbitrary of the sequence number of the requests, whereinduring the execution of a current batch number, a particular request isreceived and associated with a different batch number that is differentthan the current batch number being executed, and the particular requestis at least one of enqueued or processed based on the different batchnumber being the current batch number plus one.